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Abstract 

We propose an ^i-regularized likelihood method for estimating the inverse covari- 
ance matrix in the high-dimensional multivariate normal model in presence of missing 
data. Our method is based on the assumption that the data are missing at random 
(MAR) which entails also the completely missing at random case. The implementa- 
tion of the method is non-trivial as the observed negative log-likelihood generally is a 
complicated and non-convex function. We propose an efficient EM-algorithm for opti- 
mization with provable numerical convergence properties. Furthermore, we extend the 
methodology to handle missing values in a sparse regression context. We demonstrate 
both methods on simulated and real data. 



1 Introduction 



The most common probability model for continuous multivariate data is the multivariate 
normal distribution. Many standard methods for analyzing multivariate data, including 
factor analysis, principal components and discriminant analysis, are directly based on the 
sample mean and covariance matrix of the data. 

Another important application are Gaussian graphical models where conditio nal depen- 
denc ies among the variables are entailed in the inverse of the covariance matrix (jLauritzenl . 
1996l l. In particular, the inverse covariance matrix and its estimate should be sparse having 
some entries equaling zero since these encode conditional independencies. 

In the context of h igh-dimensional data where the nu mber of variables p is much larger 
than sample size n, Meinshausen and Biihlmann ( 20061 ) estimate a sparse Gaussian model 
by pursuing many £i-penalized procedures for every node in the graph and they prove that 
the procedure can asymptotically recover the true graph. Later other aut hors proposed 
algori th ms for the exa c t opt ii nization of the - ^i -penal ized l og-likelihood (lYuan and Lin 
(|2007l ). iBaneriee et al\ (l2008l). [Friedman et al\ (|2007bl ) and iRothman et al\ (|2008l )).~lt 
has been shown in lRavikumar et al\ (j2008l ) that such an approach is also able to recover 
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asymptotically the true graph, but Meinshausen ( 20081 ) points out that rather restric- 
tive conditions on the true covariance matrix are necessary. All these approaches and 
theoretical analyses have so far been developed for the case where all data is observed. 

However, datasets often suffer from missing values ([Little and Rubin . Il987l ;i. Besides many 
ad- hoc approaches to the missing- value proble m, there is a syste r natic a pproach b ased on 
likelihoods which is very popular nowadays (jLittle and RubinI (119871 ). ISchafeii (119971 )). 
But even estimation of mean values and covariance matrices becomes difficult when the 
data is incomplete and no explicit maximization of the likelihood is possible. A solution 
addressing this problem is given by the EM-algorithm for solving missing-data problems 
based on likelihoods. 

In this article we are interested in estimating the inverse covariance matrix in the high- 
dimensional multivariate normal model in presence of missing data. We present a new 
algorithm for maximizing the £i-penalized observed log-likelihood. The proposed method 
can be used for estimation of sparse undirected graphical models or/and regularized co- 
variance estimation for high-dimensional data where p >> n. Furthermore, once having 
a regularized covariance estimation for the incomplete data at hand, we show how to do 
^i-penalized regression, when there is an additional response variable which is regressed 
on the incomplete data. 



2 ^i-regularized inverse covariance estimation with missing 
data 

2.1 GLasso 



Let {X^-^\ . . . , X^P^) be Gaussian distributed with mean n and covariance S, i.e. Af{^, S). 
We wish to estimat e the concentra ti on ra atrix K = Given a complete random sample 



X — (xi , . . . , Xn') 1 

log-likelihood 



Yuan and Lh] (|2007l 'l propose to minimize the negative £i -penalized 
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£{fi,K;^) + X\\K\\i = -^log \K\ + - ^{xi - fifK{xi - fi) + X\\K\\i, (2.1) 



over non- negative definite matrices K {K y 0), where \\K\\i = X]j j'=i l^jj'l- Here A > 
is a tuning parameter. 



The minimizer K is easily seen to satisfy 



K = argmin{-log \K\ + iv{KS) + 



(2.2) 



where S = ^ Z^Lil^i ~ ~ and p = 

Friedman et al. (|2007bl ) propose an elegant and efficient algorithm, called GLasso, to solve 
the prob lem (12.21). We briefly re view the derivation o f thei r algorithm while details are 



given m 



Friedman et"aZI (|2007bl ) and iBaneriee et al\ (|2008l ). We will make use of this 
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algorithm in the M-step of an EM-algorithm in a missing data setup, described in Sec- 
tion [2321 



Using duahty, formula ()2.2p is seen to be equivalent to the maximization problem 

S = argmax logdet(S). (2-3) 

|S-5||oo<P 

Problem (|2.3p can be solved by a block coordinate descent optimization over each row and 
corresponding column of S. Partitioning S and S 

^^O, S=( ] (2.4) 

the block solution for the last column (T12 satisfies 

(T12 = argmin y^ll^ly. (2.5) 

y-\\(y-si2)\\ac<p 

Using duality it can be seen that solving (12. 5|) is equivalent to the Lasso problem 

/3 = argmin||^s}f /3 - i:-l'\si2\\l + pWPWi (2.6) 
13 2 

where ai2 and /3 are linked through (J12 = Sii/3/2. Permuting rows and columns so 
that the target column is always the last, a Lasso problem like (j2.6p is solved for each 
column, updat ing their estimate of E after each stage. Fast coordinate descent algorithms 
for the Lasso ( Friedman et al. . 2007al ) make this approach very attractive. Although the 
algorithm solves for S, the corresponding estimate of K can be recovered cheaply. 



2.2 MissGLasso 

We turn now to the situation where some variables are missing (i.e. not observed). 

As before, we assume {X^^\ . . . , X^^^) ~ N{p, S) be a p-variate normal distribution with 
mean n and covariance S. We then write x = (xofe^jX^js), where x represents a random 
sample of size n, -x-obs denotes the set of observed values, and x^is the missing data. Also, 
let 

^obs — {,Xobs,li Xobs,2i ■ ■ ■ iXobs,n)i 

where Xobs.i represents the set of variables observed for case z, i = 1, . . . , n. 

A simple way to estimate the concentration matrix K would be to delete all the cases 
which contain missing values and then estimating the covariance by solving the GLasso 
problem (12.21) using only the complete cases. However, excluding all cases having at least 
one missing variable can result in a substantial decrease of the sample size available for 
the analysis. When p is large relative to n this problem is even much more pronounced. 

Another ad-hoc method would impute the missing values by the corresponding mean and 
then solving the GLasso problem. Such an approach is typically inferior than what we 
present below, see also Section 14.1.11 
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Much more promising is to base the inference for /i and S (or K) in presence of missing 
values on the observed log-likelihood: 

n, 1 

S; 'X.obs) = const — — y ^ log |Sofe<j,i| — — / ^ {Xobs,i ~ lJ'obs,i) iJ^obs,i) {Xobs,i ~ l^obs,i) 
i=l 1=1 

(2.7) 

where ^obs/i and Sg^s j are the mean and covariance matrix of the observed components of 
X (i.e. Xobs) for observation i. Formally ()2.7p can be re-written in terms of K 

1 " 1 " 

£{fj,,K;yLobs) = const--'^log\{K~^)obs,i\-^'^{xobs,i-fJ-obs,i)'^{{K~^)obs,i)~^{xobs,i-tJ'obs,i)^ 

i=l 1=1 

(2.8) 

Inference for fi and K can be based on the log-likelihood (j2.8p if we assume that the 
underlying missing-data mechanism is ignorable. The missing-data mechanism is said to 
be ignorable if the probability that an observation is missing may depend on x^bs but not 
on Xmis {Missing at Random) and if the parameters of the data model ar id the parameters 



of the missingness mechanism are distinct. For a precise definition see iLittle and Rubin 
(Il987l ). 

Assuming that p is large relative to n, we propose for the unknown parameters (/i, K) the 
estimator: 

{il,K) = argmin -(penifJ; K;Xobs) (2.9) 

-ipenitJ-jK-^Xobs) = -KfJ-jK^Xobs) + (2.10) 

where K;Xobs) is given in (12. 8p . We call this estimator the MissGLasso. 

Despite the concise appearance of (|2.8p . the observed log-likelihood tends to be a com- 
plicated (non-convex) function of the individual fij and kjj', j,j' = 1, . . .p, for a general 
missing data pattern. Optimization of (j2.9p is therefore a non-trivial issue. An efficient 
algorithm is presented in the next section. 



2.3 Computation 

For the derivation of our algorithm presented in Section [2]32] we will state first some facts 
about the conditional distribution of the Multivariate Normal Model (MVN). 



2.3.1 Conditional distribution of the MVN Model and Conditional Mean 
Imputation 

Consider a partition {Xi,X2) ~ Af{fJ.,Ti). It is well known that follows a linear 

regression on X ^ with mean ^2 + S2iIl]j^^(Xi — /ii) and covariance S22 — S2iS^/Si2 
(jLauritzenl . [l99fil ). Thus, 

X2IX1 ~ ^^{^l2 + ^2i^uiXi - fii), S22 - S2iSn'Si2). (2.11) 
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Expanding the identity KTi = / gives following useful expression: 

K21 K22 ) ( S21 S22 ) " (0 / ) • ^^-^^^ 
Using p.l2p we can re-express (I2.11|) in terms of K: 

X2IX1 ~ M{fi2 - K^2^K2i{X, - fii),K^2^). (2.13) 

Formula (|2.13p will be used later in our developed EM-algorithm for estimation of the 
mean n and the concentration matrix K based on a random sample with missing values. 

The spirit of this EM-algorithm, see Section [2. 3. 2^ is captured by the following method of 
imputing missing values by conditional means due to iBuck (|l96fl): 



(1) Estimate (fJ-jK) by solving the GLasso problem ()2.2p using only the complete cases 
(delete the rows with missing values). This gives estimates fi, K. 

(2) Use these estimates to calculate the least squares linear regressions of the missing 
variables on the present variables, case by case: From the above discussion about 
the multivariate normal distribution, the missing variables of case i, Xfnis,ii given 
Xobs,i are normally distributed with mean 

IE[3^mjs,i |3^of)s,i) A*; — f^mis il^mis,mis) K^nisfibs {p^obs,i l-^obs) • 

Therefore an imputation of the missing values can be done by 



mts,i 



f^mis {,Kmis,mis) j^mis,obs {•^obs,i l-^obs) 



fj-obsj f^mis depend on case i. Furthermore, Kmis,mis denotes the sub-matrix of K 
with rows and columns corresponding to the missing variables for case i. Similarly 
Kmis,obs denotes the sub-matrix with rows corresponding to the missing variables 
and columns corresponding to the observed variables for case i. Note that we always 
notationally suppress the dependence on i. 

(3) Finally, re-estimate (/i, K) by solving the GLasso problem on the completed data in 
step (2). 



2.3.2 ^i-norm penalized likelihood estimation via the EM- Algorithm 



A conveni ent method for qptirn izing incomplete data problems like ()2.9p is the EM- 
algorithm (jPempster et al\ (119771 )). 



To derive the EM-algorithm for minimizing (12. Op we note that the complete data follows 
a multivariate normal distribution, which belongs to the regular exponential family with 
sufficient statistics 



Ti = ^Xii,^Xi2,...,^: 



'ip 



x^l 



i=l 



i=l 
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and 

/ l^i=l ^il 2^1=1 ^ilXi2 ■ ■ ■ 2^i=l XilXip i 

En 2 
i=l Xi2Xil Z^i=l ■ ■ ■ Z^i=l ^i2Xip „ 

= X X. 

The complete penalized negative log-likelihood ()2.ip can be expressed in term of the suf- 
ficient statistics Ti and T2: 

-e{fi,K;^) + X\\K\\i = -'llog\K\ + ^fi^Kf,-fi^KTi + hr{KT2) + X\\K\\^ (2.14) 

which is linear in Ti and T2. The expected complete penalized log-likelihood is denoted 
by: 

K\fi', K') = -E[£(/i, K; x) |xo6„ /i', K'] + \\\K\\i. 

The EM- algorithm works by iterating between the E- and M-step. Denote the parameter 
value at iteration m by {p,^"^\ K^™"^) (m=0,l,2,. . . ), where {fi^^\ K^'^'^) are the starting 
values. 

E-Step: Compute Q{fi, K\fi^'^\ K^-""^): 

As the complete penalized negative log- likelihood in (12.141) is linear in Ti and T2, the 
E-step consists of calculating: 

^(jn+i) ^ e[Ti|x„5„^(™),k("^)], T^"+') = E[T2|x,6„//("^),i^(™)]. 

This involves computation of the conditional expectation of Xij and XijXij', i = 1, . . . ,n, j, j' 
1, . . .p. Using formula (|2.13p we find 



Cj if Xij is missing 



where c is defined as 

(m) 



„ ._ ... , _ c/^M (r t, ■ - iS'^A 

^ f^mis y-^^mis^mis) ^^mis,obs y-^obs,'i h^obs J 



Similarly, we compute 



' XijXij' if Xij and Xij/ are observed, 

XijCji if is observed and Xiji is missing, 

Kirulmis) . , + CjCjv if and Xijv are missing. 



,(m) 

and MP^P respectively, such that the obvious indexing makes sense. 



Here the vector c and the matrix (K^l^-^) ^ are regarded as naturally embedded in 



The E-step involves inversion of a sparse matrix, namely K^^^-^, for which we can use 
sparse linear algebra. Note also that is positive definite and therefore invertible. 
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Furthermore considerable savings in computation are obtained if cases with the same 
pattern of missing X's are grouped together. 

M-Step: Compute the updates as minimizer of Q{fj,, K\fj,^"'\ K^"''>): 

It's easily seen from equation (j2.14p that fulfill the following equations: 

,,("^+1) — ^ rp(m+l) 

M — -■- 1 

n 

j^{m+i) ^ argmin|-log|/^l +tr(i^S(™+^)) + — lli^l 
K>-o [ n 

where S("^+i) = ^T^'"^^^ - ;u('"+^)(/^^"^^^)^- Therefore the M-Step reduces to a GLasso 
problem of the form (12. 2p . which can be solved by the algorithm described in Section [2.11 

2.3.3 Numerical Properties 

A nice property of every EM-algorithm is that the objective function is reduced in each 
iteration, 

Nevertheless the descent property does not guarantee convergence to a stationary point. 
A detailed accoun t of the con vergence properties of the EM algorithm in a general setting 



has been given bv lWul (jl983l ). Under mild regularity conditions including differentiability 



and continuity, convergence to stationary points is proven for the EM-algorithm. 

For the EM-algorithm described in S ect ion [2 . 3 . 2 1 which optimizes a non-differentiable func- 
tion we have the following result: 

Proposition 2.1. Every cluster point {p,,K), with K y- 0, of the sequence K^"^^);m 
0, 1, 2, . . .}, generated by the EM-algorithm, is a stationary point of the criterion function 
in (KM- 

A proof is given in the Appendix. 



2.3.4 Selection of the tuning parameter 

In practice a tuning parameter A has to be chosen in order to tradeoff goodness-of-fit and 
model complexity. One possibility is to use a modified BIC criterion which minimizes 

BIC = -2i{fi, k- ^obs) + log(n)df, 

over a grid of candidate values for A. Here {jl,K) denotes the MissGLasso estimator 
()2.9p using the tuning parameter A and df = ^{k ,^0} degrees of freedom 

dYuan and Linl . bon?! ). The defined BI C criterion is based o n the observed log-likelihood 



/C; Xofes) which is also suggested by Ibrahim et al. ( 20081 ). 



Another possibility to tune A is to use the popular V-fold cross-validation method, based 
on the observed negative log- likelihood as loss function. We proceed as follows: First 
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divide all the samples into V disjoint subgroups (folds), and denote the samples in vth 
fold by A'^ for k = 1, . . . ,V . The V-fold cross-validation score is defined as: 

CV{X) = log \{t_y)obs,i\+ '^{Xobs,i- t^obs,iVi(X'-v)obs,i) ^{Xobs,i- l^obs,i) 

where = {K^^)^^ and K^^ denotes the estimate based on the sample {^X'=i-^v')/^v 
Then, find the best A that minimizes CV{\). Finally, fit the MissGLasso to all the data 
using A to get the final estimator of the inverse covariance matrix. 



3 Extension to sparse regression 

The MissGLasso could be applied directly to high-dimensional regression with miss- 
ing values. Suppose a scalar response variable Y is regressed on p predictor variables 
we assume joint multivariate normality for X = (Y, X^^^ , ■ ■ ■ , X^P^) with 
mean and concentration matrix given by 

we can estimate (/i, K) with the MissGLasso. The regression coefficients /3 are then given 
by /3 = —kyykyx- This approach is short-sighted: a zero in the concentration matrix, 
say Kjj/ = 0, means that X^^^ and X^j'^ are conditionally independent given all other 
variables in X, where Y is included in X . But we actually care about the conditional 
independence of X^^^ and X^^'^ given all other variables in X (which does not include Y). 
Sparsity in the concentration matrix of X is not ne cessarily enforced by penaliziri g 
For a more detailed discussion about this issue, see IWitten and Tibshiranil (jioO^). 



We describe in Section 13.21 a two-stage procedure which results in sparse estimates for the 
concentration matrix K oi X and the regression parameters /3. In order to motivate the 
second stage of this procedure, we first introduce a likelihood-based method for sparse 
regression with complete data. 

3.1 ^i-penalization in the regression model with complete data 

Consider a Gaussian linear model: 

Y, = (3^Xi + e^, i = l...n, (3.15) 

i.i.d ~ AA(0,a2), (3.16) 

where Xi € are covariates. 

In the usua l linear regression model, the £i-norm penalized estimator, called the Lasso 
(jTibshiranil (|l99fil )). is defined as 

/3a = argmini||y-x/3[|2 + A|l/3|li, (3.17) 

13 2 
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with n X 1 vector y, p x 1 regression vector (3 and n x p design matrix x. The Lasso 
estimator in (13.171) is not Uk e lihood -based and does not provide an estimate of the nuisance 



parameter a. IStadler et al. (|2009l l suggest to take a into the definition and optimization 



of a penahzed hkehhood estimator: they proceed with the fohowing estimator, 



argmin -^^^<^(y|x) + A- 



argmin nlog(cr) + Trolly - x/3|P + A-!-^ 



(3.18) 



Intuitively the estimator (|3.18p penalizes the ^i-norm of the regre ssion coefficients and 
small variances a simultaneously. Furthermore IStadler et all JioS) argue that this esti- 
mator is equivariant under scaling. Most importantly if we reparametrize p = 1/a and 
(p = /3/a we get the following convex optimization problem: 



lx,px = argmin -nlog(p) + ^llyoy - X(/)|p + A 



(3.19) 



This optimization problem can be solved efficiently in a coordinate-wise fashion. The 
following algorithm is very easy to implement, it simply updates, in each iteration, p 
followed by the coordinates c^j, j = 1, . . . of 0. 

Coordinate-wise Algorithm for solving ()3.19p 

1. Start with initial guesses for 0(0), p{0). 

2. Update the current estimates p^'^^ coordinate-wise by: 



Jm+l) 



where Sj is defined as Sj 



yTj^(p{m) + ^(y2^X(/)M)2 +4y^yn 



2y^y 



if \Sj \ < A 



(A - SjO/xJxj if Sj > A 
-{X + Sj)/xJxj ifSj<-X 



-P 



(™-^^)xJy + E.<, 



,{m+l) f 



Xj Xs + J2s>j 



3. Iterate step 2 until convergence. 



With Xj we denote the jth column vector of the n x p matrix x. This algorithm can be 
implemented very efficiently as it is the case for the coordinate descent algorithm solving 
the usual Lasso problei n. For example naive up date s, covariance updates and the active- 
set strategy described in Friedman et al. ( 2007al ) and Friedman et al. ( 20081 ) are applicable 
here as well. 

Numerical convergence of the above algorithm is ensured as follows. 

Proposition 3.1. Every cluster point {p,(t>) of the sequence {{p^'^\(f)^'^^);m = 0, 1,2, . . .}, 
generated by the above algorithm, is a stationary point of the criterion function in ( 13. j 9)) . 
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A proof is given in the Appendix. 

Note that the algorithm only involves inner products of x and y. We will make use of this 
algorithm in the next section when treating regression with missing values. 

3.2 Two-stage likelihood approach for sparse regression with missing 
data 

We now develop a two-stage £i-penalized likelihood approach for sparse regression with 
potential missing values in the design matrix x. Consider the Gaussian linear model: 

x,~A/-(/x,s), x, = (xf\...,xf))eM^ 

Yi\Xi = p^X^ + e^, eii.i.d~A/'(0,a2) (3.20) 
Xi, ei independent of each other and among i = 1, . . . , n. 

If we assume model (I3.20p it's obvious that (Yi,Xi) follows again a multivariate normal 
distribution. The corresponding mean and covariance matrix are given in the following 
lemma: 

Lemma 3.1. Assuming model \3. 20\) . {Yi,Xi) is normally distributed Af (11, T,) with: 
A proof is given in the Appendix. 

In a first stage of the procedure we estimate the inverse covariance = of X using 
the MissGLasso: 

1st stage: {flx^, K^,) = aigmin K;:s.obs) + M\\K\\i. (3.22) 

{fi,K):K^O 

Let now i{f3,a, K;y,Xohs) be the observed log-likelihood of the data (y,x). In the 
second stage of the procedure we hold fi and K fixed at the values /xai and Kxi from the 
first stage and estimate /3 and a by: 

2nd stage: 0X2,^X2) = arg min (/3, a, /UAi , i^Ai ; y, Xobs) + A2 -■ (3.23) 

Note that we use two different tuning parameters for the first and the second stage, 
denoted by Ai and A2. In practice, instead of tuning over a two-dimensional grid (Ai, A2), 
we consider the 1st and 2nd stage independently. We tune first Ai using BIC or cross- 
validation as explained in Section 12.3.41 and then we use the resulting estimator in the 2nd 
stage and tune A2. 

A detailed description of the EM-algorithm for solving the 1st stage problem was given in 
Section [2321 We now present an EM-algorithm for solving the 2nd stage. In the E-step of 
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our algorithm, we calculate the conditional expectation of the complete-data log-likelihood 
given by 

e{j3, a, flx^ , Kx^ ; y, x) = a; y |x) + £{fiXi , Kx^ ; x) 

oc ^(/3,(T;y|x) oc-nlog(a)-^||y-x/3|p (3.24) 

We see from equation ()3.24p that the part of the complete log-likelihood which depends 
only on the regression parameters /3 and a is linear in the inner products y^y, y^^x and 
x^x. Therefore we can write the E-Step as: 

E-Step: 

Ti"^+^) = E[y^x|y,Xofe„/3(-),a('"),/iA,,i^Aj 

t(™+^) = E[x^x|y,x„fe„/3(-),a('"),AA,,i^A,]. 

These conditional expectations can be computed as in Section 12.3.21 using Lemma 13.11 
In particular, these computations involve inversion of the matrices K^^l. Because of the 
special structure of K^^l , see Lemma 13.11 explicit inversion is possible by exploiting the 
formula {A + hb'^)~^ = — A~^hlF / [1 + b'^ A~^b), where A~^ has been previously 
computed in the first stage. 

Finally, in the M-Step we update the regression coefficients by: 
M-Step: 

argminnlog(a) + ^ + l£^-^^;^^^ + AMk. (3.25) 

If we reparametrize p = 1/a and (/) = /3/(T in (|3.25|) . we see that the M-Step has essentially 
the same form as p.l9p . Therefore, we can use the algorithm described in Section [3. 11 but 
exchanging the inner products y^x and x-^x for x^™'"^^^ and T^2^^^\ 

4 Simulations 

4.1 Simulations for sparse inverse covariance estimation 
4.1.1 Simulation 1 



We consider model 1 and model 2 of Rothman et al. ( 20081 ) with p = 10, 30, 50: xi, . . . , x„ 
i.i.d. ~ M{0, S) with n = 100 and 

Model 1: AR(1), Sj^v = 0.7^^'-^^ 

Model 2: AR(4), = /(|/ - j| = 0) + 0.4/(|/ - j| = 1) + 0.2/(|j' - i| = 2) 

+0.2/(|/-i| =3) + 0.1/(|/-j| = 4). 

For all six settings we make 50 simulation runs. In each run we proceed as follows: 
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(1) We generate n = 100 training observations and a separate set of 100 validation 
observations. 



(2) In the training-set we delete completely at random 0%, 10%, 20%, 30%, 40% and 50% 
of the data. Per model we therefore get six training-sets with different degree of 
missing data. 

(3) The MissGLasso estimator is fitted on each of the six mutilated training-sets, with 
the tuning parameter A selected by minimizing twice the negative log-likelihood 
(log-loss) on the validation data. This results in six different estimators of the 
concentration matrix K. 

We evaluate the concentration matrix estimation performance using the Kullback-Leibler 
loss: 

Akl{K, K) = tr(Si^) - log [Si^l - p. 

As reference we compare the MissGLasso with the estimators when first imputing the 
missing values by their corresponding column means and then apply the GLasso on the 
imputed data. The tuning parameter A for the GLasso is chosen using the validation set 
as described above in (3). We call this procedure Meanlmpute. Furthermore we compute 
for both models in the case where p = 10 the (unpenalized) maximum likelihood estimator 
using an EM-algorithm as implemented in the R-package norm. We denote this estimator 
MLE. 

Results for both covariance models with six different degrees of missingness are summarized 
in Table [H which reports the average Kullback-Leibler loss and the standard deviation. 
For all models and all missingness settings the MissGLasso outperforms the Meanlmpute 
procedure. Further, already for the lowest dimensional case (p = 10) we notice that the 
MLE estimator performs very badly with high degrees of missingness. On the other hand 
our estimator stays stable. 

To assess the performance of MissGLasso on recovering the sparsity structure in K, we 
also report the true positive rate (TPR) and the true negative rate (TNR) defined as 

#true non-zeros estimated as non-zeros 



TPR 

TNR 



T^true non-zeros 
#true zeros estimated as zeros 



T^true zeros 

These numbers are reported in Table [2j For visualization we also plot in Figure [1] heat- 
maps of the percentage of time each element was estimated as zero among the 50 simulation 
runs. We note that our choice of CV-optimal A has a tendency to yield too many false 
positives and thus too low v alues for TNR: in the case without in issing values, this finding 
is theoretically supported in Meinshausen and Biihlmann ( 20061 ). 



Finally, we comment on initialization and computational timings of the MissGLasso. In 
the above simulation we used the Meanlmpute solution as starting values (^O; -K^o) for the 
MissGLasso. For a typical realization of model 2 with p = 30, 30% missing data and a 
prediction optimal tuned parameter A, our algorithm converges in 0.87 seconds and 19 
EM-iterations. All computations were carried out with the statistical computing language 
and environment R. 
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Model 1 




p=10 


p=30 


p=50 


Missing % 


MLE 


Meanimp 


MissGLasso 


MLE 


Mcanimp 


MissGLasso 


MLE 


Meanimp 


MissGLasso 


% 


0.69 


0.36 


0.36 


- 


2.02 


2.02 


- 


4.39 


4.39 




(0.17) 


(0.09) 


(0.09) 


- 


(0.20) 


(0.20) 


- 


(0.34) 


(0.34) 


10% 


0.94 


0.63 


0.41 


_ 


3.19 


2.28 


_ 


6.62 


4.95 




(0.27) 


(0.15) 


(0.11) 


- 


(0.32) 


(0.24) 


- 


(0.44) 


(0.40) 


20 % 


1.38 


1.04 


0.48 




4.73 


2.68 




9.13 


5.73 




(0.45) 


(0.22) 


(0.14) 




(0.40) 


(0.27) 




(0.63) 


(0.48) 


30% 


13.08 


1.71 


0.63 




6.58 


3.23 




12.63 


7.04 




(38.91) 


(0.32) 


(0.17) 




(0.65) 


(0.34) 




(0.96) 


(0.51) 


40 % 


244.65 


2.67 


0.79 




9.79 


4.17 




17.87 


8.77 




(419.59) 


(0.50) 


(0.20) 




(0.95) 


(0.38) 




(1.29) 


(0.67) 


50% 


384.18 


4.50 


1.11 




14.86 


5.61 




27.71 


11.73 




(518.71) 


(1.12) 


(0.28) 




(1.75) 


(0.58) 




(2.52) 


(1.12) 



Model 2 




p=10 


p=30 


p=50 


Missing % 


MLE 


Meanimp 


MissGLasso 


MLE 


Meanimp 


MissGLasso 


MLE 


Meanimp 


MissGLasso 


% 


0.69 


0.52 


0.52 




2.51 


2.51 




4.76 


4.76 




(0.17) 


(0.11) 


(0.11) 




(0.17) 


(0.17) 




(0.22) 


(0.22) 


10 % 


0.89 


0.69 


0.62 




2.84 


2.82 




5.18 


5.26 




(0.23) 


(0.14) 


(0.11) 




(0.19) 


(0.18) 




(0.27) 


(0.24) 


20% 


1.40 


0.96 


0.71 




3.51 


3.18 




6.16 


5.85 




(0.45) 


(0.20) 


(0.13) 




(0.28) 


(0.22) 




(0.36) 


(0.27) 


30 % 


10.37 


1.42 


0.88 




4.63 


3.57 




7.89 


6.49 




(50.97) 


(0.28) 


(0.17) 




(0.23) 


(0.23) 




(0.40) 


(0.26) 


40% 


115.16 


1.93 


0.99 




5.07 


3.96 




9.54 


7.22 




(245.05) 


(0.22) 


(0.16) 




(0.17) 


(0.25) 




(0.44) 


(0.30) 


50 % 


240.76 


2.59 


1.17 




5.99 


4.45 




12.18 


7.82 




(291.52) 


(0.34) 


(0.15) 




(0.31) 


(0.22) 




(0.57) 


(0.30) 



Table 1: Average (SD) Kullback-Leibler loss of MLE, Meanimp and MissGLasso with different degrees of missingness. 



Model 1 




P= 


=10 


P= 


=30 


P= 


=50 


Missing % 


TPR % 


TNR % 


TPR % 


TNR % 


TPR % 


TNR % 


0% 


100.00 


38.33 


100.00 


58.56 


100.00 


66.49 




(0.00) 


(11.94) 


(0.00) 


(3.64) 


(0.00) 


(2.06) 


10% 


100.00 


38.89 


100.00 


59.69 


100.00 


67.40 




(0.00) 


(10.76) 


(0.00) 


(2.90) 


(0.00) 


(1.89) 


20 % 


100.00 


41.61 


100.00 


60.08 


100.00 


67.93 




(0.00) 


(8.53) 


(0.00) 


(2.99) 


(0.00) 


(1.99) 


30% 


100.00 


41.39 


100.00 


60.48 


100.00 


68.54 




(0.00) 


(9.31) 


(0.00) 


(2.67) 


(0.00) 


(1.88) 


40 % 


100.00 


41.14 


100.00 


61.92 


99.92 


69.70 




(0.00) 


(7.21) 


(0.00) 


(2.45) 


(0.32) 


(1.60) 


50% 


99.71 


41.83 


99.72 


62.16 


99.65 


70.79 




(1.41) 


(8.24) 


(0.87) 


(3.17) 


(0.66) 


(1.61) 



Model 2 




P= 


=10 


P= 


=30 


P= 


=50 


Missing % 


TPR % 


TNR % 


TPR % 


TNR % 


TPR % 


TNR % 


0% 


85.89 


30.80 


64.01 


64.94 


56.09 


75.20 




(11.03) 


(18.01) 


(6.54) 


(6.36) 


(3.06) 


(2.85) 


10% 


81.83 


35.33 


59.95 


68.62 


52.47 


77.71 




(13.91) 


(20.30) 


(5.09) 


(5.12) 


(3.41) 


(2.75) 


20 % 


76.91 


39.60 


54.30 


72.53 


48.33 


79.43 




(11.51) 


(17.21) 


(5.57) 


(5.32) 


(3.27) 


(2.94) 


30% 


69.37 


49.33 


51.39 


74.65 


44.61 


82.24 




(11.16) 


(17.30) 


(5.65) 


(5.21) 


(3.99) 


(3.15) 


40 % 


64.11 


51.60 


44.85 


79.51 


38.23 


85.33 




(12.69) 


(19.74) 


(6.18) 


(4.74) 


(5.53) 


(3.58) 


50% 


54.91 


62.40 


36.51 


84.15 


30.32 


89.15 




(13.40) 


(16.75) 


(5.97) 


(4.67) 


(4.23) 


(2.62) 



Table 2: Mean (SD) of True Positive Rate (TPR) and True Negative Rate (TNR) of the 
MissGLasso estimator for inferring the zeros in if = AU numbers are in percentages. 
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o 

CO 



MissGLasso: 0% missing IVlissGLasso: 30% missing 





Figure 1: Heat-maps of zeros identified in the concentration matrix K among 50 simulation 
runs. White color stands for zeros in all 50 simulation runs. Black stands for no zeros in 
all runs. 

4.1.2 Simulation 2: MissGLasso under MCAR, MAR and NMAR 



In the simulation of Section 14.1.11 the missing values are produced completely at random 
(MCAR), i.e. missingness does not depend on the values of the data. As mentioned in 
Section 12.21 the MissGLasso is based on a weaker assumption, namely that the data are 
missing at random (MAR), in the sense that the probability that a value is missing may 
depend on the observed values but does not depend on the missing values. A missing data 
mechanism where missingness dep ends also on the rnissing values is called not missing at 
random (NMAR) , see for example iLittle and RubinI (|l987l ) . In this section we will show 
exemplarily that our method performs differently under the MCAR, MAR and NMAR 
assumption. 

We consider a Gaussian model with p = 30, n = 100 and with a block-diagonal covariance 
matrix 

'B ••• 



B 



'■■ 

B 



B 



1 0.7 0.72 
0.7 1 0.7 
0.72 0.7 1 
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Note that the concentration matrix K is again block-diagonal and therefore a sparse 
matrix. 

We now delete values from the training data according to the following missing data 
mechanisms: 

(1) for all 6 = 1, . . . , 10 and i = 1, . . . ,n: 

Xi^s.b is missing if rji^h = 1, 

where jyj^fe are i.i.d. Bernoulli random variables taking value 1 with probability vr 
and with probability 1 — vr. 

(2) for all 6 = 1, . . . , 10 and i = 1, . . . ,n: 

Xi^^.b is missing if Xi^3.b-2 < T. 

(3) for all 6 = 1, . . . , 10 and i = 1, . . . , n: 

Xi,3.b is missing if Xi^^.b < T. 

In all mechanisms the first and second variable of each block are completely observed. 
Only the third variable of each block has missing values. Mechanism (1) is clearly MCAR, 
mechanism (2) is MAR and mechanism (3) is NMAR. The probability vr and the truncation 
constant T determine the amount of missing values. In our simulation we use three differ- 
ent degrees of missingness: (a) vr = 0.25, T = ^>~i(0.25), (b) vr = 0.5, T = ^>-i(0.5) = 
and (c) vr = 0.75, T = $-^(0.75). Here, $(.) is the standard normal cumulative distribu- 
tion function. Setting (a) results in about 8|%, (b) in 16|% and (c) in 25% missing data. 
In Figure [21 box-plots of the Kullback-Leibler loss over 50 simulation runs are shown. As 
expected we see that MissGLasso performs worse in the NMAR case. The more missing 
data the more pronounced is this observation. 

4.1.3 Simulation 3: BIC and cross-validation 

So far, we tuned the parameter A by minimizing twice the negative log- likelihood (log- 
loss) on validation data. However, in practice, it is often more appropriate to use cross- 
validation or the BIC criterion presented in Section 12.3.41 

Figure [3] shows the Kullback-Leibler loss, the true positive rate and the true negative rate 
for the MissGLasso applied on model 1 with p = 30. We see from the plots that cross- 
validation and tuning using validation data lead to very similar results. On the other hand 
BIC performs inferior concerning Kullback-Leibler loss, but slightly better regarding the 
true negative rate. 

4.1.4 Scenario 4: isoprenoid gene network in Arabidopsis thaliana 

For illustration we apply our approach for modeling the isoprenoid gene network in Ara- 
bidopsis thaliana. The number of genes in the network is p = 39. The number of observa- 
tions, corresponding to different experimental conditions, is n = 118. More details about 
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(n 
o 

_l in 




1 1 r 

MCAR MAR NMAR 




T 1 r 

MCAR MAR NMAR 




T 1 r 

MCAR MAR NMAR 



Figure 2: Kullback-Leibler loss over 50 simulation runs for different missing data mech- 
anisms (MCAR, MAR, NMAR) and different degree of missingness: (a) vr = 25%, 
T = $-1(0.25), (b) vr = 50%, T = 0, (c) ^ = 75%, T = <^-^{0.75). 



the data can be found in IWille et all ^2004 ). The dataset is completely observed. Never- 
theless, we produce completely missing at random values and compare the performance of 
the MissGLasso with the estimator on the complete data. We consider the following two 
experiments. 

First experiment: We create out of the original data ten datasets by producing com- 
pletely at random m% missing values, where m = 0, 5, 10, 15, 20, 25, 30. Then we compute 
for each of these ten datasets the 10-fold cross-validation error by evaluating the corre- 
sponding prediction error (based on out-sample negative log-likelihood) on the left-out 
part of the original (complete) data. The curves in the left panel of Figure H] show the me- 
dian as a summary statistic of the cross-validation error over the ten datasets for different 
tuning parameters A. The percentages on the curves indicate the degree of missingness. 
All curves have about the same shape and indicate an optimal tuning parameter between 
2 and 3. Aside from the curve with 5% missing values, we see that the more missing data 
we produce the worse the cross-validation error curve. 

Second experiment: First we select using the GLasso on the original (complete) data 
(prediction optimal tuned) the twenty most important edges according to the estimated 
partial correlations given by 



Pjj'\rest 



\kjjt 



kjjkjiji 



= 1, • • • 



Then, we create 50 datasets by producing completely at random m% missing values and 
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validation data 



cross-validation 



BIG 




T — I — I — I — I — r 

% 20 % 40 % 



T — I — I — I — I — r 

% 20 % 40 % 



1 — I — I — I — I — r 

% 20 % 40 % 




T — I — I — I — I — r 

% 20 % 40 % 



T — I — I — I — I — r 

% 20 % 40 % 



1 — I — I — I — I — r 

% 20 % 40 % 



Figure 3: KLloss, TPR, TNR of the MissGLasso estimator tuned with either vahdation- 
data, cross-validation or BIC. Model 1 with ^? = 30 and n = 100, based on 50 simulation 
runs. 
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select using the MissGLasso for each of the 50 datasets the twenty most important edges 
according to the partial correlations Pjj'\rest- We do this for m = 0,5,10,15,20,25,30. 
Finally, we identify the overlap of the selected edges without missing values and of the 
selected edges with m% missing data. The box-plots in the right panel of Figure H] visualize 
the size of this overlap. Even with 30% missing data, the MissGLasso detects about 13 of 
the twenty most important edges of the complete data. 

1 0-fold crossvalidation error Size of edge intersection 




1 2 3 4 5 6 7 8 5 % 10 % 15 % 20 % 25 % 30 % 

tuning parameter X degree of missingness [%] 



Figure 4: Left panel: Each curve shows the median of the cross-validation error over ten 
data-sets for different tuning parameters A, where for each dataset randomly a certain 
degree of missingness is produced. The percentage on the curve indicates the degree 
missingness. Right panel: Box-plots of the overlap of the twenty most important edges 
with and without missing values over 50 datasets, where again for each dataset missing 
values are produced. 



4.2 Simulations for sparse regression 

In this section we will explore the performance of the two-stage likelihood method devel- 
oped in Section 13. 2[ In particular, we compare our new method with alternative ways of 
treating high-dimensional regression with missing values. 

In all simulations training- and validation data are generated from a model for (Y,X). 
Assuming that there are missing values only in the x matrix of the training data we apply 
one of the following methods: 

• Mean Imputation: Impute the missing values by their corresponding column means. 
Then apply the Lasso-estimator (I3.17P on the imputed data. 
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• KNN Imputation: Impute the missing values by the K-nea r est ne ighbors imputation 
method [KNNimpute) introduced by Troyanskaya et al. ( 200ll ). Then apply the 
Lasso on the imputed data. 

• MissGLasso Imputation: Compute {fi,K) with the MissGLasso estimator. Then, 
use this estimate to impute the missing values by conditional mean imputation, i.e. 
replace the missing values in observation i by Xmis,i '■— ^\xm,is,i\xohs,iT jj'^ 

K]. Finally, 

apply the Lasso on the imputed data. 

• Two-stage likelihood: This is the method introduced in Section 13. 2i (1st stage: 
solve the MissGLasso problem; 2nd stage: estimate j3 and a by minimizing a 
penalized negative log-likelihood, see equation (I3.23p . where we fixed // and K in 
the likelihood at the values from the 1st stage; initialization of EM with /3 = and 
a"^ = empirical variance of y) 

All methods, except for Mean Imputation, involve two tuning parameter. Regarding the 
first parameter, the number of neighbors in KNNimpute or the regularization parameter 
for the MissGLasso are chosen by cross-validation on the training data. The second 
tuning parameter in the Lasso or in the 2nd stage of the two-stage likelihood approach, 
respectively, are chosen to minimize the prediction error on the validation data. 



To assess the performances of all methods we use the L2-distance between the estimate 13 

\l 



and the true parameter /3, ||/3 — /3||? 



4.2.1 Simulations 

Consider the Gaussian linear model 

Yi = 0^X, + ei, i = l,...,n, (4.26) 
ei,...,e„ i.i.d ~ A/'(0,a^), 

where the covariates Xi € W, i = 1, . . . ,n, are either fixed or i.i.d ^ A/'(0, S). 

First experiment: Take p = 8, Sjj/ = rl^-^'l and /3 = (3, 1.5, 0, 0, 2, 0, 0, 0). 

We focus on four different versions of this model with different combinations of n/r/a, 
namely 20/0.5/3; 40/0.5/1; 40/0.95/1; 100/0.5/0.5. Note that with value s n/r/a 



20/0 .5/3 we have the model which was considered in the original Lasso paper (jTibshiranil . 



The box-plots in Figure [5] of the L2-distances, summarize the performance of the different 
methods for different combinations n/r/a. In this experiment 20% of the training data 
were deleted completely at random. For reference, we added a box-plot for the L2-distances 
for the Lasso carried out on complete data, i.e. before deleting 20% in the training data. 

For the model from the original Lasso paper, namely the combination 20/0.5/3, we see 
that the Lasso on complete data does not perform substantially better than simple mean 
imputation on data with 20% of the values removed. This is due to the high noise level in 
this model. By increasing n and/or scaling down a, we reduce the noise level and increase 
the signal in the data. Indeed, in the setup 40/0.5/1, the analysis with complete data 
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n=20, T=0.5, a=3 



n=40, T=0.5, a=1 




complete mean knn missgl two-stage complete mean knn missgl two-stage 



n=40, i:=0.95, o=1 n=1 00, i:=0.5, o=0.5 




complete mean knn missgl two-stage complete mean knn missgl two-stage 



Figure 5: Box-plots of the L2-distances for different values for n, t and a over 50 simulation 
runs with 20% of the training data deleted completely at random. Complete: Lasso 
on complete data (before deleting 20% of the data). Mean: Mean imputation followed 
by the Lasso. Knn: KNN imputation followed by the Lasso. Missgl: MissGLasso and 
conditional mean imputation followed by the Lasso. Two-stage: Two-stage likelihood 
approach introduce in Section [3.21 

performs now much better than all analyses carried out on data with missing values. We 
also see that the two-stage likelihood method is slightly better than the other methods. In 
the setup 40/0.95/1 we increase the correlation between the covariates by setting r from 
0.5 to 0.95. Here the new method is as good as the complete analysis. Finally in the last 
setup, 100/0.5/0.5, where n is increased and a is reduced again, the two-stage likelihood 
method is much better than the other methods. Thus, for the cases considered where 
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missing data imply a clear information loss (e.g. when the difference between complete 
and mean imputed data is large), the new two-stage procedure is best. 

Second experiment: Consider the following models: 

• Model 1: n = 100; p = 50; Sjj/ = 0.5 x ljj'<9 for j ^ j', and = 1; /3j = 2 for 
j = 1, . . . , 8 and zero elsewhere; a = 1. 

• Model 2: n = 118; p = 39; x: data from isoprenoid gene network in Arabidopsis 
thaliana (see Section [4.1.4|) : j3j = 2 for j = 1, 2, 3 and zero elsewhere; a=\. 

For both models we delete 10%, 20% and 30% of the training data completely at random. 
The results (L2-distances) are reported in Table El We read off from this table, that the 
two-stage likelihood method performs best in both models. 





mean 


knn 


missgl 


two- stage 


Model 1 10% 
20% 
30% 


0.90 (0.51) 
2.34 (1.56) 
3.09 (1.51) 


0.71 (0.43) 
1.80 (1.03) 
2.74 (1.52) 


0.57 (0.35) 
1.39 (0.88) 
2.14 (1.18) 


0.45 (0.26) 
1.08 (0.60) 
1.60 (0.70) 


Model 2 10% 
20% 
30% 


1.85 (1.09) 
3.32 (1.31) 
4.62 (1.73) 


0.73 (0.54) 
1.67 (1.19) 
2.70 (1.29) 


0.52 (0.41) 
0.97 (0.69) 
1.60 (1.01) 


0.50 (0.31) 
0.88 (0.68) 
1.53 (0.75) 



Table 3: Average (SD) L2-distance of mean, knn, missgl and two-stage with different 
degrees of missingness. 



5 Discussion 

We presented an ^i-penalized (negative) log-likelihood method for estimating the inverse 
covariance matrix in the multivariate normal model in presence of missing data. Our 
method is based on the observed likelihood and therefore works in the missing at random 
(MAR) setup which is more general than the missing completely at random (MCAR) 
framework. As argued in Section 14.1.21 the method cannot handle missingness pattern 
which are not at random (NMAR), i.e. "systematic" missingness. For optimization, we 
use a simple and efficient EM-algorithm which works in a high-dimensional setup and which 
can cope with high degrees of missing values. In Section [3l we extended the methodology 
for high-dimensional regression with missing values in the covariates. We developed a 
two-stage likelihood approach which was found to be never worse but sometimes much 
better than K-nearest neighbors or using the straightforward imputation with a penalized 
covariance (and mean) estimate from incomplete data. 

A Proofs 

Proof of Proposition [2A[ Denote by /c(x|//,i^) the multivariate Gaussian density of the 
complete data. fobsi^obslfJ-TK) the density of the observed data. Furthermore, the condi- 
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tional density of the complete data given the observed data is 

k{x.\Xobs,tJ',K) = fc{x\fI,K)/fobs{'Xobs\fJ',K). 

The penaUzed observed log-hkeUhood (j2.10p fulfills the equation 

- ipenil^, K) = - log fobs{^obs\l^. K) + A| \K\\i = Q{fi, K\fi' , K') - H{fi, K\f,', K')IA.27) 
where 



Q{^l,K\^l',K') 

H{fi,K\fi',K') 



-E[£{fi,K;^)\^obs,l^',K'] + \\\K\\, 
-E [log k (x I Xob^ , /x , /sT) I Xob3 , /x' , K'] . 



(A.28) 



By Jensen's inequality we get the following important relationship: 

H{fi,K\fi',K') > Hip,',K'\f,',K'), 

see also Wu ( 19831 ) . ipenifJ-, K), Q{fi, K\fi' , K') and HifJ-, K\fj,' , K') are all continuous 
functions in all arguments. Further, H{p^, K\ij! , K') is differentiable as a function of (/i, K). 
If we think of Q{fj,, K\fi' , K') and Hip, K\fi' , K') as functions of (fJ-jK) we write also 
Q{t,',K'){lJ',K) and /C). 

Let 6*™ = (;u(™), K^'")) be the sequence generated by the EM-algorithm. We need to prove 
that for a converging subsequence 9'^^ — )■ 6 jj — )• o o ) the directional derivative i'peni^'i d) is 



bigger or equal to zero for all directions d (jTsene] (|200lh ). Taking directional derivatives 
of equation ()A.27p yields 

-epen{e;d) = Q'^{9) - Hsie),d). 



Note that Hfj{9) = as Hg{x) is minimized for x = 9 (equation (jA.28p ). Therefore it 
remains to show that Q'q{9) > 0. From the descent property of the algorithm (equation 
(|X27ll and (|X28]l ) we have: 



^pen(9 ) ^ f-penifi ) ^ 



> 



> 



'^peny 



Equation ()A.29P and the converging subsequence imply that {^pen(^™); 
converges to lpen{9)- Further we have : 



m 



(A.29) 
0,1,2,...} 



< 



pen\ 



^""^^ _j_ e (arn+1 



'^peny 



) + HQm{9"^) — Hg 



<0 



-pen\ 



(A.30) 



.(0)=o 



The first inequality follows from the definition of the M-step. We conclude 



')-Qe 



> 0. 



(A.31) 



In each M-step we minimize the function Qgrn^x) with respect to x. Therefore we have: 
Qgn., (0-^+1) - g,™, (^-^ ) + Q,™, (0-^ ) < Qg^,{x) . (A.32) 



^0 HAMl 
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Using continuity, equation ()A.3ip and equation ()A.32p we get 

QM < Qeix) Vx 
and therefore we have proven that Q'AO] d) >Q for all directions d. 



□ 



Proof of Proposition \cl.l[ The result follows from Proposition 5.1 and Lemma 3.1 in lTseng 



(|200lh . □ 



Proof of Lemma I3.il We have 

('■•^■' ((»■")■( e)) (^.) = (J T)(x.)' (A-^=> 

From ()A.33P we see that the joint distribution of {Yi,Xi) follows a (p+l)-variate normal 
distribution with mean and covariance given by 

The expression for the concentration matrix K = can be derived by using the identity 
tk = 1. □ 
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